STA 141B Final Project
Analysis of Happiness

Shang Jiang : 915497630

   Jiahui Li : 915544392

Songkai Xiao : 915576357

Motivation

Happiness is increasingly important to both individuals and countries.

Being happy is not just about feeling good. Research shows that it also makes us healthier, more productive and nicer.

Happiness is increasingly considered the proper measure of social progress and the goal of public policy

Do you know which country is the happiest in the world?

Do you know what's the secret of the happiness, nationally speaking?

You want the answers? Follow us!

We are lucky enough to find a rank of global happiness in http://worldhappiness.report/, which motivates us to gain more insights into this interesting topic.

Related Researches shows that:

Being happy is not just about feeling good. Research shows that it also makes us healthier, more productive – and nicer

Happiness is increasingly considered the proper measure of social progress and the goal of public policy.

And this project are motivated by these questions:

  • Which countries and which regions have the highest happiness values?
  • What elements does the happiness value of a country or a region depend on? What’s the relationship between these elements and the happiness value?
  • Can these countries be separated into different groups? Are these groups distinct from each other? If yes, what's the characteristics of these groups?_
In [2]:
import pandas as pd
import numpy as np
import math
import matplotlib
import matplotlib.pyplot as plt
import seaborn as sns
import plotnine as gg
from scipy import stats, integrate
from mpl_toolkits.mplot3d import Axes3D
sns.set(color_codes=True)
%matplotlib inline
# matplotlib.style.use('ggplot')
matplotlib.rcParams['figure.figsize'] = (12,8)

import plotly.plotly as py
import plotly.graph_objs as go
from plotly.offline import download_plotlyjs, init_notebook_mode, plot, iplot
init_notebook_mode(connected=True)
from plotly import __version__
import cufflinks as cf
from sklearn.preprocessing import scale
import warnings
warnings.filterwarnings('ignore')

from sklearn import metrics
from sklearn.cluster import KMeans
from sklearn.datasets import load_digits
from sklearn.decomposition import PCA
from sklearn.preprocessing import scale
from sklearn.preprocessing import StandardScaler

import statsmodels.api as sm
from statsmodels.formula.api import ols
import scipy.stats as stats

import nltk
nltk.download("punkt")
nltk.download("averaged_perceptron_tagger")
nltk.download("brown")
nltk.download("wordnet")
nltk.download("stopwords")

import requests
import requests_ftp
import requests_cache
import lxml
import lxml.html as lx

from textblob import TextBlob
from nltk.corpus import stopwords
from nltk.corpus import wordnet
from sklearn.feature_extraction.text import TfidfVectorizer
from wordcloud import WordCloud 
# plt.style.use('ggplot')
[nltk_data] Downloading package punkt to /Users/lijh/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /Users/lijh/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!
[nltk_data] Downloading package brown to /Users/lijh/nltk_data...
[nltk_data]   Package brown is already up-to-date!
[nltk_data] Downloading package wordnet to /Users/lijh/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package stopwords to /Users/lijh/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
In [6]:
hp15 = pd.read_csv("./world-happiness-report/2015.csv")
hp16 = pd.read_csv("./world-happiness-report/2016.csv")
hp17 = pd.read_csv("./world-happiness-report/2017.csv")
raw = pd.read_csv("raw_data.csv")
In [7]:
hp15 = hp15.set_index('Country')
hp16 = hp16.set_index('Country')
hp17 = hp17.set_index('Country')
hp17 = hp17.rename(columns = {'Whisker.high':'Upper Confidence Interval','Whisker.low':'Lower Confidence Interval',
                              'Economy..GDP.per.Capita.':'Economy (GDP per Capita)',
                              'Health..Life.Expectancy.':'Health (Life Expectancy)',
                              'Trust..Government.Corruption.':'Trust (Government Corruption)',
                              'Dystopia.Residual':'Dystopia Residual'})
hp17 = hp17.join(hp16.Region)
raw['Happiness Score'] = raw['Life Ladder']
In [8]:
hpall = raw['Happiness Score'].dropna()
gdpall = raw['GDP per capita'].dropna()
socialall = raw['Social support'].dropna()
freedomall = raw ['Freedom to make life choices'].dropna()
healthyall = raw['Healthy life expectancy at birth'].dropna()
corruptionall = raw['Perceptions of corruption'].dropna()
generosityall = raw['Generosity'].dropna()

1. Data

How to find variables contributing to happiness?

Our strategy:

  1. Use a web scraping to get some terms related to happiness.
  2. Combine it with our domain knowledge to delete some misleading terms.
  3. Use research result to add some extra data.
  4. Find data from the internet.
  5. check data and clean data.

1.1 Variable Analysis:

​

1.1.1. Web Crawler:

We scrape the content from wiki (url: https://en.wikipedia.org/wiki/World_Happiness_Report) And use a word cloud to analysis which factor mainly influence the happiness. ​

In [9]:
def wordnet_pos(tag):
    """Map a Brown POS tag to a WordNet POS tag."""
    
    table = {"N": wordnet.NOUN, "V": wordnet.VERB, "R": wordnet.ADV, "J": wordnet.ADJ}

    return table.get(tag[0], wordnet.NOUN)

def word_clear(text):
    blob = TextBlob(text)
    stopwords_ = stopwords.words("english")
    new_text = " ".join(w for w in blob.words if w.lower() not in stopwords_)
    blob = TextBlob(new_text)
    tags = [wordnet_pos(x[1]) for x in blob.pos_tags]
    new_text = " ".join(x.lemmatize(t) for x, t in zip(blob.words, tags))
    blob = TextBlob(new_text)
    return(blob)

def visulization(text,max_word=30):
    wordcloud = WordCloud(
        background_color= 'white',
        stopwords = stopwords.words("english"),
        scale= 10,
        max_words = max_word,
        max_font_size = 40,
    )
    wordcloud = wordcloud.generate(str(text))
    plt.figure(1,figsize=(12,12))
    plt.axis('off')
    plt.imshow(wordcloud) 
    
def web_craw(url,url_base):
    response = requests.get(url)
    response.raise_for_status()
    doc = response.text
    html = lx.fromstring(doc,base_url = url_base)
    html.make_links_absolute()
    links = html.xpath("//p")
    i = 0 
    text_ = [[]] * len(links)
    for t in links:
        text_[i] = t.text_content().lower()
        i = i + 1 
    text = ''.join(text_)
    #text = word_clear(text)
    return(text)
In [10]:
url = 'https://en.wikipedia.org/wiki/World_Happiness_Report'
url_base = 'https://en.wikipedia.org/'

text = web_craw(url,url_base)
visulization(word_clear(text),max_word=30)

1.1.2. Domain knowledge

Using domain knowledge to delete some unrelated terms and we can see the new word cloud. Which implies the economic, social, health, policy and other factors will influence the happiness level.

In [11]:
text_ = text.split()
text_ = [x.replace('chapter','') for x in text_]
text_ = [x.replace('written','') for x in text_]
text_ = [x.replace('report','') for x in text_]
text_ = [x.replace('change','') for x in text_]
text_ = [x.replace('world','') for x in text_]
text_ = [x.replace('nation','') for x in text_]
text_ = [x.replace('data','') for x in text_]
text_ = [x.replace('people','') for x in text_]
text_ = [x.replace('measure','') for x in text_]
text_ = [x.replace('one','') for x in text_]
text_ = [x.replace('include','') for x in text_]
text_ = [x.replace('use','') for x in text_]
text_ = [x.replace('finding','') for x in text_]
text_ = [x.replace('well','') for x in text_]
text_ = [x.replace('factor','') for x in text_]
text_ = [x.replace('subjective','') for x in text_]
text__ = ' '.join(text_)
visulization(text__,max_word=15)

1.1.3 Research results:

After reading some research papers related to happiness,, we decide to add two variables: freedom and extra money they have.

1.2 Data Source and Data Cleaning:

1.2.1. What do we use to describe the factor and Where do we find our data and the data description?

  • We use Healthy life expectancy to describe health factor

    Calculated by the authors based on data from the WHO, WDI, and statistics published in journal articles. http://worldhappiness.report/

1.2.2. What is the problem of our data and what did we do to fix them?

The problem is there are a lot of missing values and since we have plenty of data, we just drop them. Also, the price of GDP is not consistent, we use extra data from WDI to transparent them to a constant 2017 international dollar price.

2.Exploratory Data Analysis

4.1 Distribution of Variables

In [12]:
sns.set(rc={'figure.figsize':(20,20)})
plt.subplot(421)
sns.distplot(hpall)
plt.subplot(422)
sns.distplot(gdpall)
plt.subplot(423)
sns.distplot(socialall)
plt.subplot(424)
sns.distplot(freedomall)
plt.subplot(425)
sns.distplot(healthyall)
plt.subplot(426)
sns.distplot(corruptionall)
plt.subplot(427)
sns.distplot(generosityall)
Out[12]:
<matplotlib.axes._subplots.AxesSubplot at 0x1c0a37ce48>
  • The happiness score is mainly concentrated between 3.5 to 7.5. Almost no people in a country think that they are extremely unhappy or happy.
  • GDP is mainly concentrated between 0 to 40000. And it is severely right-skewed.
  • The social support is mainly concentrated between 0.7 to 1.0. People in lots of countries are satisfied with their social support
  • The Freedom to make like choices is concentrated in 0.5 to 0.9. People in most countries think they have freedom to make life choice, but still in a few countries, people do not think they have enough freedom to make life choice.
  • For the healthy life expectancy, there are almost three peaks, one is at 53, one is at 64, on is at 72.
  • The perceptions of corruption is mainly concentrated between 0.6 to 1.0. The higher the perceptions of corruption is, the worse the trust of government. So, people in most countries are not trust their government so much.

Since GPD is severely right-skewed, which will effect our analysis, so we make transformation, taking log, on it.

In [14]:
raw['Log GDP per capita']= [math.log(x) for x in raw['GDP per capita']]
gdpall = raw['Log GDP per capita'].dropna()

4.2 Distribution of Variables VS Year

In [15]:
sns.set(rc={'figure.figsize':(35,35)})


plt.subplot(421)
sns.set(font_scale=3) 
ax = sns.violinplot(x="year", y="Log GDP per capita", data=raw)
ax.set(title = "Log GDP per capita By Year", xlabel = "", ylabel = "")
ax.set_xticklabels(ax.get_xticklabels(), rotation = 0)
ax.tick_params(labelsize=30)
plt.subplot(422)
sns.set(font_scale=3) 
ax = sns.violinplot(x="year", y="Happiness Score", data=raw)
ax.set(title = "Happiness Score By Year", xlabel = "", ylabel = "")
ax.set_xticklabels(ax.get_xticklabels(), rotation = 0)
ax.tick_params(labelsize=30)
plt.subplot(423)
sns.set(font_scale=3) 
ax = sns.violinplot(x="year", y="Social support", data=raw)
ax.set(title = "Social support By Year", xlabel = "", ylabel = "")
ax.set_xticklabels(ax.get_xticklabels(), rotation = 0)
ax.tick_params(labelsize=30)
plt.subplot(424)
sns.set(font_scale=3) 
ax = sns.violinplot(x="year", y="Freedom to make life choices", data=raw)
ax.set(title = "Freedom to make life choices By Year", xlabel = "", ylabel = "")
ax.set_xticklabels(ax.get_xticklabels(), rotation = 0)
ax.tick_params(labelsize=30)
plt.subplot(425)
sns.set(font_scale=3) 
ax = sns.violinplot(x="year", y="Healthy life expectancy at birth", data=raw)
ax.set(title = "Healthy life expectancy at birth By Year", xlabel = "", ylabel = "")
ax.set_xticklabels(ax.get_xticklabels(), rotation = 0)
ax.tick_params(labelsize=30)
plt.subplot(426)
sns.set(font_scale=3) 
ax = sns.violinplot(x="year", y="Perceptions of corruption", data=raw)
ax.set(title = "Perceptions of corruption By Year", xlabel = "", ylabel = "")
ax.set_xticklabels(ax.get_xticklabels(), rotation = 0)
ax.tick_params(labelsize=30)
plt.subplot(427)
sns.set(font_scale=3) 
ax = sns.violinplot(x="year", y="Generosity", data=raw)
ax.set(title = "Generosity By Year", xlabel = "", ylabel = "")
ax.set_xticklabels(ax.get_xticklabels(), rotation = 0)
ax.tick_params(labelsize=30)
  • From the plot, we found that the data of 2005 is abnormal, the distribution of which is much different from those of other years, so we take a careful look on the data of year 2015, to check if they should remain in the data we use to the further analysis.
In [16]:
pd.DataFrame(raw.year.value_counts()).T
Out[16]:
2011 2014 2015 2012 2016 2013 2010 2009 2008 2007 2006 2005
year 146 145 143 142 141 137 124 114 110 102 89 27

The data of 2005 only record 27 countries, which is much less than the othe years, so we guess the abnormal distribution is because the size of data is smaller, not because of other experimental or recording errors.

We plot the distribution of every variable of 2005 with that of whole data to have a further check.

In [17]:
data05 = raw[raw.year==2005]

sns.set(rc={'figure.figsize':(20,20)})
plt.subplot(421)
ax = sns.distplot(hpall)
ax = sns.distplot(data05['Happiness Score'].dropna())
plt.subplot(422)
ax = sns.distplot(gdpall)
ax = sns.distplot(data05['Log GDP per capita'].dropna())
plt.subplot(423)
ax = sns.distplot(socialall)
ax = sns.distplot(data05['Social support'])
plt.subplot(424)
ax = sns.distplot(freedomall)
ax = sns.distplot(data05['Freedom to make life choices'].dropna())
plt.subplot(425)
ax = sns.distplot(healthyall)
ax = sns.distplot(data05['Healthy life expectancy at birth'].dropna())
plt.subplot(426)
ax = sns.distplot(corruptionall)
ax = sns.distplot(data05['Perceptions of corruption'].dropna())
plt.subplot(427)
ax = sns.distplot(generosityall)
ax = sns.distplot(data05['Generosity'].dropna())
  • From the plot, we can find the distribution of data of 2005 year is similar to the higher part of the distribution of whole data. When we take a closer look of the data, we find the countries included in 2005 data are mainly developed country, maybe because the report only collect the happiness score of some developed countries in the first years.

4.3 Distribution of Variables VS Region

In [18]:
raw.rename(columns={'country':'Country'},inplace =True)
raw = raw.set_index('Country')
raw = raw.join(hp15.Region)

sns.set(rc={'figure.figsize':(30,30)})

plt.subplot(321)
sns.set(font_scale=3) 
ax = sns.boxplot(x = 'Happiness Score', y = 'Region',data=raw)        #draw the barplot
ax.set(title = "Happiness Score By Region", xlabel = "", ylabel = "")
ax.set_xticklabels(ax.get_xticklabels(), rotation = 90)
ax.tick_params(labelsize=16) 

plt.subplot(322)
sns.set(font_scale=3) 
ax = sns.boxplot(x = 'Log GDP per capita', y = 'Region',data=raw)        #draw the barplot
ax.set(title = "Log GDP per capita By Region", xlabel = "", ylabel = "")
ax.set_xticklabels(ax.get_xticklabels(), rotation = 90)
ax.tick_params(labelsize=16) 

plt.subplot(323)
sns.set(font_scale=3) 
ax = sns.boxplot(x = 'Social support', y = 'Region',data=raw)        #draw the barplot
ax.set(title = "Social support By Region", xlabel = "", ylabel = "")
ax.set_xticklabels(ax.get_xticklabels(), rotation = 90)
ax.tick_params(labelsize=16) 

plt.subplot(324)
sns.set(font_scale=3) 
ax = sns.boxplot(x = 'Healthy life expectancy at birth', y = 'Region',data=raw)        #draw the barplot
ax.set(title = "Healthy life expectancy at birth By Region", xlabel = "", ylabel = "")
ax.set_xticklabels(ax.get_xticklabels(), rotation = 90)
ax.tick_params(labelsize=16) 

plt.subplot(325)
sns.set(font_scale=3) 
ax = sns.boxplot(x = 'Freedom to make life choices', y = 'Region',data=raw)        #draw the barplot
ax.set(title = "Freedom By Region", xlabel = "", ylabel = "")
ax.set_xticklabels(ax.get_xticklabels(), rotation = 90)
ax.tick_params(labelsize=16)

plt.subplot(326)
sns.set(font_scale=3) 
ax = sns.boxplot(x = 'Perceptions of corruption', y = 'Region',data=raw)        #draw the barplot
ax.set(title = "Corruption By Region", xlabel = "", ylabel = "")
ax.set_xticklabels(ax.get_xticklabels(), rotation = 90)
ax.tick_params(labelsize=16)
  • Every variables differ much with region.
  • Australia and New Zealand, Western Europe and North America have large happiness score, large GDP,large social support, large healthy life expectancy, large freedom and small perception of corruption.
  • Sub-Saharan Africa and Southern Asia have small happiness score, small GDP, small social support, small healthy life expectancy. But the freedom of these two region is not very small.

4.4 Relation between Happiness with Other Variables

In [19]:
def drawworld(df, name, year):
    data = dict(type = 'choropleth', 
    locations = df.index,
    locationmode = 'country names',
    colorscale = [[0,"rgb(153, 51, 51)"],[0.5,"rgb(240, 240, 240)"],[1,"rgb(255, 153, 204)"]],
    z = df[name], 
    text = df.index,
    colorbar = {'title':str(name)})
    layout = dict(title = 'World '+str(name) + ' in ' +str(year), 
                  geo = dict(showframe = False, projection = {'type': 'Mercator'}))
    choromap3 = go.Figure(data = [data], layout=layout)
    iplot(choromap3)
In [20]:
data16 = raw[raw.year ==2016]
drawworld(hp16,'Happiness Score',2016)
In [21]:
drawworld(data16,'Log GDP per capita',2016)
In [22]:
drawworld(data16,'Social support',2016)
  • From the world distribution, we can find that the distributions of Happiness Score, GDP, Social Support, Healthy life expectancy are similar, so obviously, they have not weak correaltion between happiness score with GDP, Social Support and Healthy life expectancy.
In [23]:
data16_new=data16.iloc[:,:-1].dropna()
data16_new_hp = data16_new['Happiness Score']
data = pd.DataFrame(scale(data16_new),index=data16_new.index, columns= data16_new.columns)
data['Happiness Score'] = data16_new_hp
data = data.sort_values(['Happiness Score'],ascending=True)


cf.go_offline()
data[['Happiness Score','Log GDP per capita','Social support','Healthy life expectancy at birth']].iplot(kind='spread')
In [24]:
cf.go_offline()
data[['Happiness Score','Freedom to make life choices','Generosity']].iplot(kind='spread')
In [25]:
cf.go_offline()
data[['Happiness Score','Perceptions of corruption']].iplot(kind='spread')
  • GDP, Social Support, Healthy life expectancy are incresing as happiness score increasing, so we think these three variables have strong positive relationship with happiness score.
  • Freedom and Generosity are increasing as happiness score too, but not so significant, so we think these two variables have moderate positive realtionship with happiness score.
  • Perceptions of corruption are decreasing as happiness score when the happiness score is high, so we think this variable has negative relationship with happiness score.

3.Modeling:

3.1. The pairwise plot shows there is significant linear relationship and thus we fix a linear regression model with R-squared 0.728.

In [4]:
model_data = pd.read_csv('final_model_data.csv')
model_data = model_data.dropna()
pair = model_data[['Life Ladder', 'Log GDP per capita','Social support', 'Healthy life expectancy at birth',
       'Freedom to make life choices','Perceptions of corruption']]
In [5]:
sns.pairplot(pair,kind="reg",diag_kind="kde")
Out[5]:
<seaborn.axisgrid.PairGrid at 0x1c0f128b00>
In [26]:
y = pd.DataFrame(model_data['Life Ladder'])
x = pd.DataFrame(model_data.iloc[0:,3:9])
x = sm.add_constant(x)
x= x[['const', 'Log GDP per capita', 'Social support', 'Healthy life expectancy at birth', 'Freedom to make life choices',
        'Perceptions of corruption']]
model = sm.OLS(y, x).fit()
model.summary()
Out[26]:
OLS Regression Results
Dep. Variable: Life Ladder R-squared: 0.728
Model: OLS Adj. R-squared: 0.727
Method: Least Squares F-statistic: 666.5
Date: Sun, 18 Mar 2018 Prob (F-statistic): 0.00
Time: 20:57:17 Log-Likelihood: -1116.4
No. Observations: 1249 AIC: 2245.
Df Residuals: 1243 BIC: 2276.
Df Model: 5
Covariance Type: nonrobust
coef std err t P>|t| [0.025 0.975]
const -1.6485 0.206 -8.013 0.000 -2.052 -1.245
Log GDP per capita 0.3120 0.030 10.556 0.000 0.254 0.370
Social support 2.4001 0.196 12.227 0.000 2.015 2.785
Healthy life expectancy at birth 0.0300 0.004 7.985 0.000 0.023 0.037
Freedom to make life choices 1.2781 0.141 9.049 0.000 1.001 1.555
Perceptions of corruption -0.6865 0.107 -6.405 0.000 -0.897 -0.476
Omnibus: 12.893 Durbin-Watson: 0.621
Prob(Omnibus): 0.002 Jarque-Bera (JB): 15.001
Skew: -0.170 Prob(JB): 0.000553
Kurtosis: 3.416 Cond. No. 889.

3.2. Model Interpretation:

  • P-value of all the parameters are zero, implies it is very significant. and at almost the same importance.
  • For log(GDP), social support, healthy life expectancy, and freedom to make choice, the parameters is positive which shows they are positively related.
  • For perceptions of corruption, the parameters is negative which shows they are negatively related.

3.2.3. Model diagnosis

In [31]:
plt.figure(figsize=(20,8)) 
plt.subplot(121)
plt.scatter(model.fittedvalues,y['Life Ladder'],)
plt.title('Fitted values V.S Happiness Score')

plt.subplot(122)
plt.plot(model.resid)
plt.title('Residual plot')
Out[31]:
Text(0.5,1,'Residual plot')

4.Clustering

From the regreesion result, it appears these variables we choose(i.e Log GDP per capita, Social support, Healthy life expectancy at birth, Freedom to make life choices, Perception of corruption) contribute greatly to the happiness score.

The following question comes naturally:

(i) Can these countries be separated into different groups? Are these groups distinct from each other?

(ii) If yes, what's the characteristics of these groups?

Now, the method K-Means Clustering plays an important role to answer these questions.

4.1 Data Aggregation

Since some variables of some countries in our data are missing for some reasons (not recorded, not covered, etc.). But we still want to fully utilize these imcomplete data, here we get the mean of each variable of each country given the data we have; and take the mean as the primary index of these variables.

In [28]:
data = pd.read_csv('final_model_data.csv')
data.set_index('country', inplace = True)
data.dropna(inplace = True)

country_index = data[['Log GDP per capita', 'Social support','Healthy life expectancy at birth', 'Freedom to make life choices',
       'Perceptions of corruption']]

country_index_year_average = country_index.groupby('country').agg(np.mean)

X = country_index_year_average
X.shape
Out[28]:
(155, 5)
In [29]:
data.columns
Out[29]:
Index(['year', 'Life Ladder', 'Log GDP per capita', 'Social support',
       'Healthy life expectancy at birth', 'Freedom to make life choices',
       'Generosity', 'Perceptions of corruption'],
      dtype='object')
In [30]:
X.head()
Out[30]:
Log GDP per capita Social support Healthy life expectancy at birth Freedom to make life choices Perceptions of corruption
country
Afghanistan 7.447978 0.520064 48.727537 0.557993 0.812616
Albania 9.172559 0.732704 67.999568 0.612438 0.857864
Algeria 9.481694 0.824816 63.891903 0.558112 0.664049
Angola 8.816016 0.737973 44.572942 0.455957 0.867018
Argentina 9.658498 0.904571 66.336436 0.707794 0.848656

4.2 Data Scaling

Because the K-means Clustering method is scale-sensitive (the algorithm it use calculates the "distances" between variables and thus the different scales are influential!), we should standardize the data first.

In [31]:
std_data = pd.DataFrame(scale(X),index=X.index, columns= X.columns)
std_data.head()
Out[31]:
Log GDP per capita Social support Healthy life expectancy at birth Freedom to make life choices Perceptions of corruption
country
Afghanistan -1.383230 -2.433164 -1.438294 -1.177814 0.346557
Albania 0.024501 -0.607026 0.811286 -0.772498 0.595116
Algeria 0.276840 0.184022 0.331807 -1.176925 -0.469563
Angola -0.266536 -0.561774 -1.923250 -1.937423 0.645401
Argentina 0.421160 0.868958 0.617152 -0.062620 0.544534

4.3 Clustering

Now, we are ready to cluster!

Based on the visualization and regression result, we decide to choose three clusters. and 'k-means++' to select initial cluster centers for k-mean clustering in a smart way to speed up convergence.

In [32]:
kmeans = KMeans(init='k-means++', n_clusters=3, n_init=10, random_state=669)
kmeans.fit(std_data)
std_data['cluster'] = kmeans.fit_predict(std_data)
std_data.head()
Out[32]:
Log GDP per capita Social support Healthy life expectancy at birth Freedom to make life choices Perceptions of corruption cluster
country
Afghanistan -1.383230 -2.433164 -1.438294 -1.177814 0.346557 0
Albania 0.024501 -0.607026 0.811286 -0.772498 0.595116 1
Algeria 0.276840 0.184022 0.331807 -1.176925 -0.469563 1
Angola -0.266536 -0.561774 -1.923250 -1.937423 0.645401 0
Argentina 0.421160 0.868958 0.617152 -0.062620 0.544534 1
In [33]:
std_data.tail()
Out[33]:
Log GDP per capita Social support Healthy life expectancy at birth Freedom to make life choices Perceptions of corruption cluster
country
Venezuela 0.468382 1.018980 0.365033 -0.363368 0.214688 1
Vietnam -0.571165 0.154136 0.487111 1.145390 0.175085 1
Yemen -0.759764 -0.975924 -0.891941 -0.584356 0.459840 0
Zambia -0.866873 -0.491032 -1.407057 0.191078 0.503226 0
Zimbabwe -1.479496 0.116254 -2.025435 -1.252868 0.600748 0
In [34]:
std_data.cluster.value_counts().plot.bar(fontsize = 12, title = 'Cluster', figsize = (10,10))
Out[34]:
<matplotlib.axes._subplots.AxesSubplot at 0x1c281cf160>

4.4 Cluster Visualization Using PCA

Above we get the clustering results. Most countries belong to Cluster1, and fewest countries belong to Cluster0.

But it's still hard to see the results clearly since they have too many variables(columns). Fortunately, we can take advantage of PCA to visualize the clustering results.

In [34]:
pca = PCA(n_components=3)
pc_X = pca.fit_transform(std_data) #PCA
print(pc_X.shape)
print(pca.explained_variance_ratio_)
print(pca.components_)
PC_X_df = pd.DataFrame(pc_X,columns= ['PCA1','PCA2','PCA3'],index=X.index)
PC_X_df.head()
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
<ipython-input-34-f16e13a6c458> in <module>()
      1 pca = PCA(n_components=3)
----> 2 pc_X = pca.fit_transform(std_data) #PCA
      3 print(pc_X.shape)
      4 print(pca.explained_variance_ratio_)
      5 print(pca.components_)

NameError: name 'std_data' is not defined

The first two principal components explain about 80% variability of the data; and the first three explain about 90%. The first three components appear good enough.

In [36]:
PC_cluster = pd.concat([PC_X_df,std_data['cluster']],axis = 1)
PC_cluster.head()
Out[36]:
PCA1 PCA2 PCA3 cluster
country
Afghanistan 3.263673 0.586694 0.901606 0
Albania 0.306625 -0.966462 0.667841 1
Algeria -0.136839 -0.546875 1.041719 1
Angola 2.473345 -0.604863 0.414008 0
Argentina -0.730871 -0.929608 -0.395686 1
In [37]:
PC_cluster.PCA1['Norway'],PC_cluster.PCA2['Norway'] # the cordinate of Norway
Out[37]:
(-3.460204450725418, 1.1253012436384977)
In [38]:
x, y = PC_cluster.PCA1, PC_cluster.PCA2

ind1,ind2,ind3 = PC_cluster.index[PC_cluster.cluster==0],PC_cluster.index[PC_cluster.cluster==1],PC_cluster.index[PC_cluster.cluster==2]
fig = plt.figure(figsize = (12,12))
ax = plt.subplot(111)  

ax.scatter(x[ind1], y[ind1], c='black',s = 100,label='cluster 0')  
ax.scatter(x[ind2], y[ind2], c='red',s = 100,label='cluster 1')
ax.scatter(x[ind3], y[ind3], c='b',s = 100,label='cluster 2')

plt.text(PC_cluster.PCA1['Norway'],PC_cluster.PCA2['Norway'],'Norway',fontsize = 20)
plt.text(PC_cluster.PCA1['Central African Republic'],PC_cluster.PCA2['Central African Republic'],'Central African Republic',fontsize = 20)
plt.text(PC_cluster.PCA1['South Korea'],PC_cluster.PCA2['South Korea'],'South Korea',fontsize = 20)

ax.set_ylabel('PCA2')
ax.set_xlabel('PCA1')
handles, labels = ax.get_legend_handles_labels()
ax.legend(handles, labels)
Out[38]:
<matplotlib.legend.Legend at 0x1c28256f98>
In [33]:
x, y, z = PC_cluster.PCA1, PC_cluster.PCA2, PC_cluster.PCA3
ind1,ind2,ind3 = PC_cluster.index[PC_cluster.cluster==0],PC_cluster.index[PC_cluster.cluster==1],PC_cluster.index[PC_cluster.cluster==2]

fig = plt.figure(figsize = (12,12))
ax = plt.subplot(111, projection='3d')  

ax.scatter(x[ind1], y[ind1], z[ind1], c='black',s = 100,label='cluster 0')  
ax.scatter(x[ind2], y[ind2], z[ind2], c='r',s = 100,label='cluster 1')
ax.scatter(x[ind3], y[ind3], z[ind3], c='b',s = 100,label='cluster 2')

ax.set_zlabel('PCA3',fontsize=20)  
ax.set_ylabel('PCA2',fontsize=20,rotation=-45)
ax.set_xlabel('PCA1',fontsize=20,rotation=90)
handles, labels = ax.get_legend_handles_labels()
ax.legend(handles, labels)

plt.show()
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
<ipython-input-33-3d1de343d983> in <module>()
----> 1 x, y, z = PC_cluster.PCA1, PC_cluster.PCA2, PC_cluster.PCA3
      2 ind1,ind2,ind3 = PC_cluster.index[PC_cluster.cluster==0],PC_cluster.index[PC_cluster.cluster==1],PC_cluster.index[PC_cluster.cluster==2]
      3 
      4 fig = plt.figure(figsize = (12,12))
      5 ax = plt.subplot(111, projection='3d')

NameError: name 'PC_cluster' is not defined
  • From the 2D Scatter Plot(here I text three representative country in each group: 'Norway', '), we can see the clustering performs well: The three groups are separated clearly. And it shows that Cluster0, Cluster1, Cluster2 are corresponding to the "least happy", "moderately happy", "happiest" country groups, respectively.
  • From the 3D Scatter Plot, we can see that the clustering performance is also great.

4.5 KMeans Clustering: Pivot Table to Compare Clusters

Now, let's look at each group closely with the 2017 happiness data.

And then we gain insights into the data using Pivot Table to compare clusters.

In [40]:
hp17_rank = hp17[['Happiness.Rank','Happiness.Score']]
myPC = PC_cluster.join(hp17_rank).dropna()
myPC[myPC.cluster == 2].sort_values('Happiness.Rank').head()
Out[40]:
PCA1 PCA2 PCA3 cluster Happiness.Rank Happiness.Score
country
Norway -3.460204 1.125301 0.128002 2 1.0 7.537
Denmark -3.605642 1.903572 0.632309 2 2.0 7.522
Iceland -2.877368 -0.193715 -0.786487 2 3.0 7.504
Switzerland -3.574740 1.316670 0.555252 2 4.0 7.494
Finland -3.426964 1.617223 0.510117 2 5.0 7.469
In [41]:
myPC[myPC.cluster == 1].sort_values('Happiness.Rank').head()
Out[41]:
PCA1 PCA2 PCA3 cluster Happiness.Rank Happiness.Score
country
Israel -1.151440 -1.381517 -0.002622 1 11.0 7.213
Costa Rica -1.431192 -0.007898 -1.076286 1 12.0 7.079
Chile -0.999477 -0.595616 0.294164 1 20.0 6.652
Brazil -0.986647 0.013131 -0.437841 1 22.0 6.635
Czech Republic -1.218439 -1.130108 -0.799962 1 23.0 6.609
In [42]:
myPC[myPC.cluster == 0].sort_values('Happiness.Rank',ascending = False).head()
Out[42]:
PCA1 PCA2 PCA3 cluster Happiness.Rank Happiness.Score
country
Central African Republic 3.810935 1.759145 0.004982 0 155.0 2.693
Burundi 4.493166 0.627987 2.033011 0 154.0 2.905
Tanzania 1.628315 0.255149 -0.562945 0 153.0 3.349
Rwanda 0.752087 3.812362 1.144188 0 151.0 3.471
Togo 3.598597 0.966464 1.062436 0 150.0 3.495

The three tables give the PCAs of the three clusters and their happiness ranks and scores.

In [43]:
final_df = country_index_year_average.join(myPC)
final_df.head()

pivot_table_cluster = final_df.pivot_table(index='cluster', values=['Happiness.Rank','Happiness.Score','Log GDP per capita', 'Social support',
       'Healthy life expectancy at birth', 'Freedom to make life choices',
       'Perceptions of corruption', 'PCA1', 'PCA2','PCA3'],aggfunc = np.mean)
pivot_table_cluster

cluter_index_summary = pivot_table_cluster.sort_values('Happiness.Rank')[['Happiness.Rank','Happiness.Score','Log GDP per capita', 'Social support',
       'Healthy life expectancy at birth', 'Freedom to make life choices',
       'Perceptions of corruption', 'PCA1', 'PCA2','PCA3']]
cluter_index_summary
Out[43]:
Happiness.Rank Happiness.Score Log GDP per capita Social support Healthy life expectancy at birth Freedom to make life choices Perceptions of corruption PCA1 PCA2 PCA3
cluster
2.0 20.629630 6.834889 10.683349 0.924934 70.222853 0.885061 0.447729 -2.939115 0.805515 0.305835
1.0 70.240506 5.526734 9.498370 0.832774 64.169662 0.690961 0.832806 -0.255923 -0.680073 -0.082095
0.0 129.093023 4.083512 7.624338 0.675969 50.669898 0.656097 0.787266 2.181057 0.660879 0.005282

The pivot table give a summary for us to compare the clusters.

It appears clearly that: The happiest group: have the higgest Log GDP per capital, Healthy life expectancy, Freedom to make life choices and the trust of government(the smaller 'Perceptions of corruption' is, the more confidence they have in the government).

The least happy group: have the lowest Log GDP per capital, Healthy life expectancy, Freedom to make life choices and the trust of government(the smaller 'Perceptions of corruption' is, the more confidence they have in the government).

Thus, we can draw a conclusion: happiness score is highly and positively correlated with the Log GDP per capital, Healthy life expectancy, Freedom to make life choices and the trust of government.

5.Clustering